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Abstract 

We revisit the mathematical foundations of proper scoring rules (PSRs) 
and Bregman divergences and present their characteristic properties in a 
unified theoretical framework. In many situations it is preferable not to 
generate a PSR directly from its convex entropy on the unit simplex but 
instead by the sublinear extension of the entropy to the positive orthant. 

This gives the scoring rule simply as a subgradient of the extended en¬ 
tropy, allowing for a more elegant theory. The other convex extensions 
of the entropy generate affine extensions of the scoring rule and induce 
the class of functional Bregman divergences. We discuss the geometric 
nature of the relationship between PSRs and Bregman divergences and 
extend and unify existing partial results. We also approach the topic of 
differentiability of entropy functions. Not all entropies of interest possess 
functional derivatives, but they do all have directional derivatives in al¬ 
most every direction. Relying on the notion of quasi-interior of a convex 
set to quantify the latter property, we formalise under what conditions a 
PSR may be considered to be uniquely determined from its entropy. 

Keywords: proper scoring rule, entropy, Bregman divergence, quasi-interior, 
extension, characterisation, derivative, subgradient, convex, sublinear, homoge¬ 
neous 


1 Introduction 

Proper scoring rules (PSRs) originated in probabilistic forecasting as devices 
that assess the quality of forecasts and elicit private information. The subject 
enjoys a considerable applied and theoretical interest in recent years (Gneiting 
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and Katzfuss 2014). The present paper focuses on mathematical and geometric 
aspects of PSRs and Bregman divergences and elucidates the relationship be¬ 
tween them. Having evolved to a large degree separately, the two notions have 
been investigated under restrictive and specialised conditions. We survey the 
available literature on this topic and systematise the relevant results by pre¬ 
senting them in a general and unified theoretical framework. A more detailed 
discussion on individual aspects of our review is given in the subsection below. 

First, let us outline how the rest of the paper is organised. In Section 2, 
we discuss the characterisation of PSRs and the related canonical extension of 
the entropy as a sublinear function to the positive orthant. In Section 3, we 
explore more general convex extensions of the entropy function to and beyond 
the positive orthant. This construction generates affine extensions of PSRs, 
also known as affine scoring rules, and induces the class of functional Bregman 
divergences. In Section 4, we examine and generalise some technical results 
about Bregman divergences under regularity conditions that are natural for 
PSRs. We investigate the differentiability properties of entropy functions in 
Section 5. Here, we describe the collection of all PSRs generated by a given 
entropy function and formalise under what conditions this collection contains 
a unique element. In the short Appendix, we present the proof of a technical 
result. 

1.1 Motivation and relation to literature 

The characterisation of PSRs through the 1-homogeneous extension of the en¬ 
tropy to the positive orthant was first developed by McCarthy 1956; Hendrickson 
and Buehler 1971. A simpler characterisation of PSRs is due to Savage 1971; 
Gneiting and Raftery 2007, who consider entropy functions on the unit sim¬ 
plex. The unit simplex is, however, a negligible set in measure and topology, 
which obfuscates questions pertaining to regularity and uniqueness of subgradi¬ 
ents, differentiability of entropy functions, etc. On the other hand, any proper 
scoring rule on the unit simplex is simply a subgradient relative to the positive 
orthant of the 1-homogeneous extension of the entropy. This fact provides us 
with means not only to study regularity of entropy functions, but also it offers a 
precise geometric interpretation for the condition of propriety of a scoring rule. 
The extension is implicit in the context of scoring rules that are 0-homogeneous 
in form such as the proper local scoring rules of higher orders (Dawid, Lauritzen, 
and Parry 2012; Parry, Dawid, and Lauritzen 2012). A prominent member of 
that class is the Hyvdrinen scoring rule, which supplies an attractive and sta¬ 
tistically consistent alternative for the method of pseudolikelihood (Dawid and 
Musio 2014; Dawid and Musio 2012; Hyvarinen 2005; Hyvarinen 2007; Forbes 
and Lauritzen 2014). The pseudo spherical scoring rules (Gneiting and Raftery 
2007; Dawid 2007) are another important family of 0-homogeneous scoring rules. 

Confining attention only to sublinear extensions of the entropy instead of 
the more general convex extensions is too limiting. For example, the simplest 
convex extension of the power entropy, corresponding to the power scoring rules, 
is the power function. This family is very popular in the meteorological liter- 
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ature, mainly in terms of the quadratic scoring rule, or the analogous Brier 
score (Brier 1950; Gneiting and Katzfuss 2014). The corresponding entropy is 
also known as Tsalis entropy, a concept that originates in the physics literature 
(Dawid and Musio 2014). The power scoring rules are familiar for their robust¬ 
ness properties both under infinitesimal contamination (Basu et al. 1998) and 
heavy contamination (Kanamori and Fujisawa 2015). The latter work provides 
some practical justification for our interest in extending PSRs to the positive 
orthant and beyond, as its methods rely on unnormalised statistical models. 

General convex extensions of the entropy naturally appear in the context 
of Baysian games, where the analogous quantities to scoring rules are termed 
allocation rules (Frongillo and Kash 2014). In this broader context the authors 
introduce the notion of an affine score, which may be visualised geometrically as 
a family of supporting hyperplanes to a convex function. This construction gen¬ 
eralises the expected scores of PSRs and induces the class of functional Bregman 
divergences. The same structure may be found in the elicitation of expectiles, 
and other linear functionals of predictive densities, because the associated con¬ 
sistent scoring rules have the form of a Bregman divergence (Abernethy and 
Frongillo 2012). Convexity plays an important role in more general elicitation 
problems (Steinwart et al. 2014; Ziegel 2014; Williamson 2014). 

Bregman divergences are central objects in machine learning and statistics 
where they serve as natural generalisations to the Euclidean metric. Their 
properties have been deeply studied on Euclidean spaces (Banerjee et al. 2005; 
Bauschke and Borwein 2001; Boissonnat, Nielsen, and Nock 2010), and partial 
generalisations are available in the context of functional spaces (Frigyik, Sri- 
vastava, and Gupta 2008). The latter work, however, uses assumptions that 
are not general enough to include most of the proper scoring rules of practi¬ 
cal interest. In contrast, here we present both notions under unified regularity 
conditions and demonstrate that the characterisation of Bregman divergences 
in the Euclidean setting (Banerjee et al. 2005) extends to the present setting. 
Another aspect we investigate here is the well-known fact that the generalised 
quadratic divergence is the only symmetric Bregman divergence on Euclidean 
spaces (Boissonnat, Nielsen, and Nock 2010). We find an analogue of this fact 
in the context of a very general class of functional Bregman divergences. 

It is interesting to understand in what formal sense an entropy function de¬ 
fines a unique PSR. In finite dimensions, or if the entropy function allows a 
continuous extension to an open cone in a normed space, the question may be 
resolved with the standard methods of convex analysis (Ovcharov 2014). Specif¬ 
ically, the entropy function has a unique subgradient at an interior point of 
its domain if and only if it is differentiable at that point (Borwein and Van- 
derwerff 2010). In infinite dimensions, however, things get complicated due to 
the fact that many standard function spaces, such as the Lebesgue spaces over 
R", are endowed with positive orthant that has empty interior. This implies 
that any extension of the positive orthant to an open cone will contain densi¬ 
ties that change sign. The entropies of many important scoring rules, such as 
the logarithmic scoring rule and the proper local scoring rules of higher orders, 
cannot be defined for signed densities. Gonsequently, these entropies are not 
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differentiable in the standard sense. It turns out that we may still resolve our 
question with the help of the notion of quasi-interior. The latter notion refines 
the notion of interior of a convex set in infinite dimensions when the interior is 
empty. For our purposes, we need the algebraic version of quasi-interior from 
Ovcharov 2014, which is analogous to its better-known topological equivalents 
(Borwein and Lewis 1992; Fullerton and Braunschweiger 1963). One of our key 
results there is the fact that an entropy function may still have a unique sub¬ 
gradient on the nonempty quasi-interior of a positive cone. As an illustration, 
we explicitly construct a positive cone with nonempty quasi-interior where the 
Hyvarinen scorng rule is the unique 0-homogeneous subgradient of its entropy 
function. Here, we discuss in greater detail some of the basic properties of alge¬ 
braic quasi-interior and generalise the uniqueness result to an arbitrary convex 
domain. 

2 The canonical extension 

The common application of unnormalised statistical models in the literature 
motivates us to consider the possible extensions of PSRs to positives cones. 
The extension of the entropy function as a sublinear function to the positive 
cone of the set of probabilities is referred to as canonical. This extension is 
universal to all entropy functions and encapsulates the condition for propriety 
of a scoring rule directly, as we will see below. 

We begin with some standard definitions. We fix a measure space 
and a convex class V of probability distributions on H which are absolutely 
continuous with respect to the measure ^ and represented by their probability 
densities. 

Definition 2.1. We call the functions / : H ^ R V-integrable if 



for every p gV. We denote by C{V) the linear space of "P-integrable functions. 

Formally, any mapping S : V ^{'P) is a scoring rule. Suppose that X is 
a random variable taking values in H with unknown true distribution p gV. If 
q GV is a predictive density for p, then the random variable S{q){X) assigns 
a numerical score to each outcome of X. The assumption of 7^-integrability 
ensures that S{q){X) has a finite expectation. 



which is also termed the expected score of S. Viewing scoring rules as positive 
incentives which a forecaster wishes to maximise in the long run, only those 
scoring rules which satisfy the following condition encourage honesty. 
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Definition 2.2. A scoring rule S that maximises its expected score at the true 
density, 

p- S(p) =maxp- S(q), (1) 

qev 

is called proper. If the true density is always a unique maximiser, S is called 
strictly proper. 

Related concepts are the (negative) entropy, 

<^{p) = p- S{p), (2) 

for every p G V, and the score divergence D : P x V given by 

D{p,q)=p-S{p)-p-S{q). (3) 

It follows immediately from Definition 2.2 that d> is convex, being a pointwise 
maximum of linear functions, and that D is nonnegative. Strict propriety is 
equivalent to $ being strictly convex and to D being positive-definite, i.e. equal 
to zero only for p = q. 

Mathematically, propriety of a scoring rule is equivalent to convexity of the 
associated entropy, which will be the key structural property we explore in what 
follows. All subsequent results about PSRs and Bregman divergences will be 
presented in the unified framework of the space spanP (the linear span of V) 
and its dual C{V). Notice that in finite dimensions span 7^ may be identified 
with some Euclidean space M”, and due to the fact the latter is self-dual, C{V) 
also identifies with R". In infinite dimensions, however, self-duality holds only 
in special cases and in general C{'P) is more refined than the algebraic dual of 
span'P, but less refined than the topological dual of spanP, whenever the latter 
is equipped with topology. Consequentially, the linear functionals in C(V) are 
generally not continuous, and we will be primarily focused on their algebraic 
properties. 

Throughout, by K, we denote a convex set such that V C fC C span'P. 
Therefore, the elements of JC are linear combinations of probability densities. 

Definition 2.3. Given a function $ : Ki —M and a point q G 1C, we say that 
q* G C{V) is a (V-integrahle) subgradient of $ at g relative to K. if 

<^{p) > {p - q) ■ q* +‘^{q) (4) 

for all p G K.. If the above inequality is strict for all p ^ q, the subgradient q* 
is called strict. 

So, subgradients are linear functionals that define supporting hyperplanes 
to the graph of a convex function. Specifically, the set 

{{p,y) \pG spanP, y={p-q)-q*+ $(q)} 

is a supporting hyperplane to $ at g. A convex function may have many sub¬ 
gradients at a given point. The collection of all subgradients of $ at g is called 
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the subdifferential of <& at g and denoted by 9$(g). Suppose that d^{q) 7 ^ 0 for 
each q € IC. Then, we call a selection of subgradients S{q) G d^{q), for each 
q G K., a subgradient of $ on /C. 

Definition 2.3 implies the following characterisation of PSRs due to Gneiting 
and Raftery 2007. 

Theorem 2.4. A scoring rule 5 : 7^ —> C{P) is (strictly) proper if and only if 
there exists a pair (<i>, 4**), where $ ; 7^ —K. is (strictly) convex and $* : 7^ —>■ 
C{V) is a subgradient of $ relative to V, such that 

S{q){x) = <^*{q){x) + 4>(g) - q ■ (5) 

for every q GV . 

The merit of this result lies in the simplicity of its proof and the fact that it 
can be easily extended to arbitrary convex domains, as we will see in the next 
section. On the other hand, the theorem does not explain why only certain 
subgradients of may be identified with PSRs, which means that we are still 
lacking a precise geometric interpretation of the condition of propriety. 

Our next goal is to present such an interpretation by exploiting a beautiful 
connection with Euler’s homogeneous function theorem. To that end, let us first 
review some properties related to homogeneity. For two sets A and B in span 7^, 
we employ the Minkowski sum and difference notation: APB = {a±&|a G 
A,b G 77}. For A G R and A C spanT*, we write XA = {Aa | a G A}. A set 
C C spanT* is called a convex cone if AC = C and C + C = C for all A > 0. 
Throughout, we take the conical hull of a set C, denoted cone C, to mean the 
smallest convex cone that contains C. Let a function / : C —?> R be given, 
where C is a convex cone. It is said that / is a-homogeneous for some a G R 
if f{Xq) = X^'fi^q) for every q G C and every A > 0. Notice that a convex, 
1-homogeneous function is a sublinear function. An extended version of Euler’s 
homogeneous function theorem states that if $ : C —R is 1-homogeneous, then 

q ■ d^{q) = $(g) ( 6 ) 

for every q G C (Hendrickson and Buehler 1971; Ovcharov 2014). The above 
identity relates sets, since d^[q) is generally a multi-valued map. It can be 
shown further that the subdifferential is a 0 -homogeneous multi-valued map in 
the sense that it satisfies the relation d^{\q) = 9d>(g), for every A > 0 and 
every q G C. 

In view of the above, the extension of a PSR and its entropy as a 0- 
homogeneous and 1-homogeneous function, respectively, to coneT’ = {Ap | A > 
0,p G T*} behaves consistently. Explicitly, given S -.V we set 

■ 

for every q G cone 7^, where g • 1 is the normalising constant of q. Similarly, for 
4> : 7^ —>• R, we write 

cl>(g) = (g . 1)4> , 
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for every g € coneP. Due to (6), in the context of 1-homogeneous functions, 
Definition 2.3 reduces to the following. 

Definition 2.5. Given a 1-homogeneous function $ : cone7^ —>■ K. and a point 
q € coneV, we say that q* £ 73(7^) is a (V-integrahle) subgradient of $ at g 
relative to cone V if 

^(p) >p-q* (7) 

for all p £ cone 7^, with equality for p = q. If the above inequality is strict for 
all p not positively collinear to q, the subgradient q* is called strict. 

Notice the special convention for a strict subgradient on coneT’ in the above 
definition. This notion of subgradient is coherent with the condition for propri¬ 
ety, which follows from the formal equivalence of Definition 2.2 and Definition 
2.5. Thus we arrive at the classical characterisation of PSRs due to McCarthy 
1956 and Hendrickson and Buehler 1971. The formulation we give below em¬ 
phasises the geometric nature of the result. 

Theorem 2.6. Let S V ^ ^{P) be a scoring rule and ^ : V ^ M. be defined 
as $(p) = p ■ S{p), for every p G V. Then S is (strictly) proper if and only 
if the 0-homogeneous extension of S to cone 7^ is a (strict) subgradient of the 
1-homogeneous extension of ^ to cone7^. 

See also Williamson 2014 who characterises PSRs by making use of the 
duality theory of convex functions. We now proceed to compare the two notions 
of subgradient employed in Theorem 2.4 and Theorem 2.6, respectively. We first 
would like to show that if $* : T’ 73(P) is a subgradient of a convex function 
on V, then S in Theorem 2.4 extends to a subgradient of $ on the positive 
cone of V. 


Corollary 2.7. Consider a (strictly) convex function : 7^ —>■ M that has a 
subgradient : V ^ 73(P) on V. Then 

S{q){x) = <i>*{q){x) + <f>{q) - q ■ $*(g) 


is also a (strict) subgradient o/<I> onV. Moreover, the 0-homogeneous extension 
of S is a (strict) subgradient of the 1-homogeneous extension o/$ on cone 7^. 

Proof. The proof follows immediately from Theorem 2.6 and the fact that S is 
a PSR due to Theorem 2.4. However, it would be instructive to show the claim 
independently. The 0-homogeneous extension of S is given by 





( 8 ) 


Clearly, 


q ■ S{q) = {q ■ 1)$ 


g 
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for any q G cone'P, as desired. We also have, 


f ■ SI") = P ■ 'f' (^) (»^) + (f> ■ 1) (f (^) - ^ f • (^)) 


for any p,q G cone 7^, as desired. 


□ 


Another useful consequence of the above characterisations is the following. 

Corollary 2.8. Let $ : 7^ —>■ R 6 e a (strictly) convex function that has a 
subgradient $* : 7^ —^ ^{P) on V. Then $* is a (strictly) PSR associated with 
$ if and only if q ■ $*(( 7 ) = $( 9 ), for every q GV. 

Proof. The proof follows directly from the hypothesis and (5). □ 


In the following example, we illustrate how the two theorems may be applied 
effectively to generate PSRs from convex functions. The reader may compare 
our methods of deriving PSRs with those of Dawid 2007. 

Example 2.9. Let V denote the set of probability densities in the Lebesgue 
space p). We consider the quadratic entropy $(( 7 ) = q ■ q on V and wish 

to find a PSR associated with $. It is sufficient to find any subgradient of $ 
on V. The easiest way of doing so is to extend $ as the quadratic function 
on spanT* = Lf{Lt,p) and make use of the fact that the extended entropy is 
differentiable. Its functional derivative is given by 


$*(g)=2g, 

which is also a subgradient of $ on L^{Lt,p), and in particular on P. However, 
<I>* is not a PSR associated with $ as (7 • $*(< 7 ) For that reason, we 

apply Theorem 2.4 to find that 


S{q) = $*(g) + $(g) - q ■‘P* {q) = 2q - q ■ q 


is a PSR associated with $. This scoring rule is known as the quadratic scoring 
rule. 

On the other hand, let us next consider the spherical entropy on V, defined 
as $(( 7 ) = [q ■ ( 7 )^/^, and also find a PSR associated with it. Notice now that $ 
has a natural extension to span V as the L^-norm, which is a sublinear function. 
Using the fact that $ is a composition of the functions x -G and q ^ q ■ q, 
we find that its functional derivative on span V is given by 




q 

(< 7 . <7)1/2- 


In the light of either Theorem 2.6 or Corollary 2.8, is a PSR associated with 
$. This scoring rule is known as the spherical scoring rule. 



Proper Scoring Rules and Bregman Divergences 


9 


3 General convex extensions 

In certain situations, we need to consider more general convex extensions of the 
entropy function to and beyond the positive cone. For example, as we saw in 
Example 2.9, the simplest convex extension of quadratic entropy to the whole 
space is the quadratic function $(g) = q ■ q, while the 1 -homogeneous extension 
of $ to coneP, $(( 7 ) = q-q/q-1, cannot be defined for signed densities for which 
q ■ I = 0. 

We recall that by /C we denote a convex set such that V C K. C spanP. 

Definition 3.1. Suppose that d) : Kl R has a subgradient : 1C ^ ^i'P) on 
/C. The functional Bregman divergence on /C associated with the pair ($, $*) is 
the function $.) : /C x /C R given by 

q) = ^(p) - (p-q)- ^*(q) - ^(q), (9) 

for all p,q G 1C. 

We note that D is always nonnegative, while D is positive-definite if and 
only if $ is strictly convex. Notice that if 

S{q){x) = <^*{q){x) -b <^>{q) - q ■ <^*{q) 

is a PSR on V, then 

p- S{p) - p- S{q) = 

is a Bregman divergence on V. Hence, score divergences are Bregman diver¬ 
gences for probability densities. 

On the extended domain 1C, the Bregman divergence is defined as the vertical 
distance between $ and the supporting hyperplanes to $ generated by $*. 
Consider the function s : /C x /C —)• R given by 

sip, q) = ip-q)- ^*{q) + $(g), 
for all p,q G 1C, which allows us to write (9) simply as 
D{ct>.,s>-)ip,q) = sip,p) - s{p,q) 

for all p,q G 1C. Notice that for each q G 1C, si-,q) is an affine functional on 
spanT*. 

In order to present the following definition, we denote by A{V) the vector 
space of affine functionals A on span 7^ of the form Aip) = p ■ f + a, where 
/ G C{V) and a G R is a constant. 

Definition 3.2. Any mapping S : 1C ^ A{V) is said to be an affine scoring rule 
on 1C. The associated function s : spanT" x /C —>■ R, defined as s{p, q) = S{q){p), 
is the score function of S. The rule S is said to be (strictly) proper if its score 
function s (strictly) satisfies 

sip,q) < s{p,p) 

for all p,q G 1C. 
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The following characterisation of proper affine scoring rules is due to Frongillo 
and Kash 2014, who refer to affine scoring rules as affine scores. 

Theorem 3.3. An affine scoring rule S : JC ^ *5 (strictly) proper if 

and only if there is a (strictly) convex function $ : /C —R and a subgradient 
$* : /C —?► C(V) o/ $ on K. such that 

s{p,q) = {p-q)-^*{q) + ^{q) (10) 


for all p,q G 1C. 

Let us now describe the important special case where an affine scoring rule 
is in fact linear and may be identified with a family of subgradients of a convex 
function. To that end, let C denote a convex cone such that V C C C. span 7^. 

Corollary 3.4. Let S : C ^ be a proper affine scoring rule and let 

$ : C —)■ K, 4)(p) = s{p,p), be the associated extended entropy. Then, S is a 
linear map if and only if $ is 1-homogeneous on C. 

To summarise, in this section we have considered convex extensions of the 
entropy function outside the set of probabilities V. Any family of supporting 
hyperplanes to an extended entropy function defines a proper score function, 
which generalises the expected score of a PSR. The construction also induces 
the class of functional Bregman divergences. 

4 Properties of functional Bregman divergences 

Here we generalise some basic properties of Bregman divergences to the func¬ 
tional setting. In our first result we characterise functional Bregman divergences 
under the notion of subgradient that is natural for PSRs. The result extends a 
similar claim in Banerjee et al. 2005, Appendix A from the Euclidean setting. 
In this section again JC denotes a convex set such that V C JC C span 7^. 

Theorem 4.1. Let D : 1C x 1C ^ be a divergence on 1C. Then D is a 
functional Bregman divergence on tC if and only if for any a G 1C the function 
$(p) = D{p,a) is (strictly) convex and $ has a subgradient 4)* : A7 —?► P-iP) such 
that 

D{p,q) = 77($,<i,.)(p,g) 

for all p,q G K. 

Proof. Suppose that 77 is a functional Bregman divergence associated with some 
pair ($1,4)^). Then, the function 

4>(g) = 4>i(p) - p • 4>*(a) -f a • 4>*(a) - 4>i(a) 

is (strictly) convex. Since 4>(g) and 4>i(p) only differ by an element in AifP), 
they generate the same Bregman divergence. The sufficiency part is trivial. □ 
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A divergence function D on /C is said to be symmetric whenever D{p, q) = 
D{q,p) for all p,q £ 1C. Bauschke and Borwein 2001; Boissonnat, Nielsen, and 
Nock 2010 study the symmetric Bregman divergences on the real line and on 
Euclidean spaces, respectively. The former authors show that the generalised 
(or weighted) quadratic divergence is the only symmetric divergence on the real 
line. We note that the proof easily extends to separable Bregman divergences. 
Let us recall that a functional Bregman divergence H : Ail x /C —> K. is separable 
if D is in the form 



for any p,q £ K,, where I?/ is a Bregman divergence on the real line induced 
by some convex differentiable function / : K ^ K, and ^ is a measure on Q 
that is absolutely continuous with respect to p. In what follows, we present 
a generalisation of that proof to the context of a large class of non-separable 
Bregman divergences. 

To that end, let $ ; /C —>■ K be a convex function of the form 



( 11 ) 


where / and v are as above, while </) : R —^ R is an increasing function. This 
family includes, for example, the pseudospherical scoring rules, whose diver¬ 
gences are evidently non-separable. When cj) is the identity, we recover the class 
of entropy functions that generate all separable Bregman divergences. Suppose 
that spanP may be identified with a Frechet space A/”, and let /C be an open 
convex set in Af containing V. We denote by Af* the topological dual space of 
Af, which we assume to be identihable with a subspace of C{'P). 

Theorem 4.2. Let $ : /C —>■ R 6e a strictly convex function of the form (11). 
Suppose also that </> and f are twice differentiable and <I> is twice Frechet differ¬ 
entiable. If the associated functional Bregman divergence is symmetric, then $ 
has the form 




or 


up to affine terms a qdv (3, where ol and /3 are real constants. 

The proof is relegated to the Appendix. In view of the theorem, the only 
symmetric functional Bregman divergences on /C induced by convex functions 
$ in the form (11) are the following: 


Notice that by the Cauchy-Schwartz inequality, 




(/. 
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hence the quadratic divergence D 2 has greater discriminatory power than Di. 
The two divergences may be regarded as members of the class of generalised 
quadratic divergences, but we will not try to formalise the latter notion in 
infinite dimensions. Our negative result means that apart from the generalised 
quadratic divergences, all other Bregman divergences are nonsymmetric. 

For completeness, we note that Boissonnat, Nielsen, and Nock 2010 show 
that if Q is a positive-definite matrix of dimension n, then the generalised 
quadratic divergence, 

D{p,q) = (p-qYQip-q), 

closely related to Mahalanohis distance, is the only symmetric Bregman diver¬ 
gence on R". Notice that the latter divergence is separable if and only if Q 
is diagonal. It would be of interest to extend their method of proof to the 
functional setting, which will likely offer a more general result than Theorem 

4.2. 


5 Differentiability properties of entropy func¬ 
tions 


It is well-known that in finite dimensions any convex function on open domain 
is differentiable and has a unique subgradient everywhere except on a set of 
Lebesgue measure zero (Rockafellar 1972). This implies that an entropy func¬ 
tion in finite dimensions determines a unique PSR up to a negligible set. Direct 
generalisation of this result in infinite dimensions is difficult as there is no ana¬ 
logue of the Lebesgue measure in that setting. Instead, in what follows we 
describe the subdifferentials of entropy functions and provide sufficient condi¬ 
tions for unique subgradient. 

We begin with the case where the extended entropy is a differentiable func¬ 
tion with respect to the Gateaux derivative, which we review next. To that 
end, let us suppose that spanP may be identified with a normed space (Af, H-H) 
and denote by Af* the topological dual space of M. Furthermore, let also Af* 
be identifiable with a subspace of CiV). As usual, the set K. is convex and 
V CK, C spanT". 


Definition 5.1. Suppose that the set K, is open with respect to the topology 
of Af. A function $ : /C —>■ R is Gateaux differentiable at a point g G /C if there 
is q* G Af* such that for every p G A/", the limit 


p ■ q* = lim 

t-j-O 


^{q + tp) - 4»(g) 
t 


exists. The functional q* is called the Gateaux derivative of $ at g 


The Gateaux derivative is necessarily unique from definition. We say that 
$ is differentiable on Af if dJ is differentiable at every point in Af. The Gateaux 
derivative has a natural geometric interpretation in the context of convex func¬ 
tions, as shown by the following standard result from convex analysis (Aragon 
Artacho et al. 2014; Borwein and Vanderwerff 2010; Zalinescu 2002). 
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Theorem 5.2. Suppose that the set 1C is open with respect to the topology of M, 
and Zei $ : /C —>■ R 6 e a convex and continuous function. Then, $ is Gateaux 
differentiable on K. if and only if $ admits a unique subgradient $* : /C —>■ TV* 
at each point in 1C. In this case 3>* is the Gateaux derivative of $ on 1C. 

In the light of the theorem, every convex differentiable function $ : /C —R 
with gradient $* : /C ^ Af* defines a unique collection of supporting hyper¬ 
planes to its graph. The restriction to probabilities of these hyperplanes defines 
the expected score of a unique PSR. We illustrate the theorem with our next 
example. See also Dawid 2007, Section 5 for comparison. 

Example 5.3. Let V be the set of all probability densities in the Lebesgue space 
Af = L'’'(0,/i), for 1 < 7 < oo, and consider the power entropy function, 

^ 7 ( 7 *) = [ P'^{x)dp{x), 

Jq 

for p G L^{D,p). We have that span'P = Af and the topological dual space of 
p) is Af* = p). Clearly, N* may be identified with a subspace 

oiC{V). 

We proceed to compute the Gateaux derivative of <i>.y on L'^ {Cl, p). We have 


lim 

t —^0 


^jjq + tp) 

t 


^ 7 ( 9 ) 


d 

dt 


{qtT ■ 1 


= p-iq 


t^O 

7-1 


Since ^ € A/**, is indeed the Gateaux derivative of Thus, 

the Bregman divergence on associated with is 

D-fip,q)=p-p'^~^ -ip-q)- - q ■ q'*~^- 


The associated score function is 


a^{p,q) = {p-q)-iq'^ ^ + q-q'^ \ 

and = Sy{p,p). The restriction of s.y to V yields the PSR 

S,{q)=jq^-^-{j-l)q-q^-\ 

the power scoring rule with exponent 7 . 

As it is well-known, the above assumptions do not apply to important en¬ 
tropies such as Shannon entropy and Hyvarinen entropy, which do not have 
functional derivatives. On the other hand, all entropy functions of practical 
interest have well-behaved directional derivatives. Before we recall the rele¬ 
vant definition, we note that in what follows we do not assume that spanP is 
equipped with topology, and hence spanP is a general vector space. As usual, 
the set 1C is convex and V C fC C span'P. 
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Definition 5.4. The right directional derivative of$:/C—J-RatgG/C along 
the vector p G cone(/C — q) is defined as the limit 


$+(p, q) 


^(g + tp) - 4>(g) 
4 ^- 0 + t 


( 12 ) 


whenever it exists. 

Geometrically, the set cone(/C — q) gives all non-exterior directions to the set 
1C based at q. When $ is convex, the above limit always exists in a generalised 
sense that includes convergence to —oo. The subdifferential of $ is characterised 
by the following result. 

Theorem 5.5. Let $ : /C —>■ M 6e a convex function. Then <i> has a V-integrable 
subgradient at a point q G K. if and only if there is q* G ClV) such that 

p-q* < ^+ip,q) 


for all p G cone(/C — q). 

The proof is a minor variant of Ovcharov 2014, Theorem 3.1. 

We next discuss the question when the subdifferential of $ at a given point 
q GK has a unique element in C{V). First, define the set 

0{q) = cone(/C — g) fl — cone(/C — g), 

which is a vector subspace of spanP. On this subspace the right directional 
derivative g) is always finite (Borwein and Vanderwerff 2010). Standardly, 
if 0{q) = spanP, then g is an (algebraically) interior point of K. (relative to 
span7^). If 1C has empty interior, however, that is, 0{q) span 7^ for any g G TC, 
then we may refine the notion of interior by assuming that 0(g) has in a certain 
sense negligible complement in span 7^. We proceed to formalise that sense. 

To that end, let us recall that if £1 is a subset of span 7^, the set of all 
/ G T-{fP) such that 

p-f = o, 

for all p G 75, is the annihilator of E in C{V). We denote this set by E^. 
Clearly, E^ is a linear subspace of C{V). In the case where E^ = {0}, we say 
that E has trivial annihilator. 

Definition 5.6. Any point q G 1C such that 0(g) has trivial annihilator in 
C{'P) is called an algebraically quasi-interior point of 1C relative to span7^. The 
collection of all algebraically quasi-interior points of 1C is the algebraic quasi¬ 
interior of 1C, denoted by qint 1C. 

It is not hard to see that the algebraic quasi-interior of a convex set 1C co¬ 
incides with the relative interior of 1C in finite dimensions. Similarly, if span 7^ 
coincides with a normed space N such that N* = C{P) and 1C has nonempty 
topological interior, then the topological interior of K. coincides with the alge¬ 
braic quasi-interior of 1C. If the topological interior of 1C is empty, however. 
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its algebraic quasi-interior may not be empty, which reflects the fact that the 
spaces 0{q) do not have to coincide with the whole space span'P, as long as 
their complements are negligible in the precise sense of Dehnition 5.6. It may 
also be shown by a standard argument that if qi G qint K. and <72 G 1C, then the 
relative interior of the line segment [qi,q 2 \ lies in qint/C. In particular, qint/C 
is convex. Finally, in Ovcharov 2014 we show that it is not hard to hnd con¬ 
vex cones in F^(]R”) with nonempty quasi-interior that are suitable domains for 
standard entropy functions such as Shannon entropy and Hyvarinen entropy. 


Theorem 5.7. Let ^ : 1C be a convex function. If q G qint /C and there is 
q* G CifP) such that 


p ■ q* = q) (13) 


for all p G cone(/C — q), then q* is the unique V-integrable subgradient of $ at 
q relative to 1C. 


If the assumptions of Theorem 5.7 hold for any q G qint 1C, then the resulting 
proper affine scoring rule is uniquely associated with $ on qint 1C. The proof of 
the theorem is similar to that of Ovcharov 2014, Theorem 3.2. See the examples 
there which show that the logarithmic and Hyvarinen scoring rules are the 
unique 0-homogeneous 7^-integrable subgradients of their entropy functions on 
the (nonempty) quasi-interior of a suitably chosen positive cone. The theorem is 
general enough to include all PSRs of probability densities that are of practical 
interest. 

A natural setting to apply the previous two results is the following one. 
Assume that 1C is large enough so that 


V C cone(/C — q) 


(14) 


for any q G V. Condition (14) states that any direction p G V based at any 
q G V is non-exterior for the set K. (that is, for some A > 0, Ap -I- g G 1C). 
For example, the choice of /C = coneV always satisfies condition (14). Due to 
(14), $Y(p, g) is well-defined for any p,q G V (but may be equal to — 00 ). If 
additionally V C qint(/C) and satisfies the assumptions of Theorem 5.7 for 
any q G V, then $ has a unique subgradient at any point in V relative to K. in 
the class T('P). The resulting proper scoring rule is uniquely associated with its 
extended entropy with respect to the latter notion of subgradient. 
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Appendix 

Proof of Theorem ^.2. Let ; A/" ^ M and <!>" : A/” x A/” —>■ M denote the first 
and second Frechet derivatives of $ on /C. A computation shows 

=<^' f{p{x)di^{x)^ j f'{p{x))f{x)dv{x), 

■ ^"{p) =(j)'( f f{p{x))dL'{x)) [ f {p{x))£,{x)r]{x)diy{x)+ 

\Jq J Jq 

</>" ( f f{p{x))dv{x)] [ f {p{x))£,{x)dv{x) [ f{p{x))p{x)dv{x). 

\J n / J Cl J Cl 

We remark that denotes both the duality pairing with respect to J\f and A/”*, 
and with respect to span7^ and C{'P). This is well-justified since f'{p)p and 
f"{p)p^ must be in CifP) for all p e /C and all ^ G spanT* due to the hypothesis. 

Symmetry of the Bregman divergence associated with $ means that we have 
the identity 


2 $(p) -{p-q)- $'(g) = 2<^{q) - [q - p) ■ $'(p) 

Let pt denote p -I- tr, for t G [0,1], r G A/". Replace p with pt above and 
differentiate with respect to t at t = 0 to find 

2 r • $'(p) — r • $'(g) = r • $'(p) — {{q —p),r) ■ $”(p). 

Since r G A/" is arbitrary, we have 

<^\q) = {q,-)-<^’\p) + ^’{p)-{p,-)-<^"{p). 

Fix p and consider that q is the only variable above. Using the explicit form of 
<i>"(p), we get that 

$'(<;) = 2aaq + 2/3(g • b)b + c, 

where a, /3 G K, and a, 6 , c : 17 —>■ K. In view of the fundamental theorem of 
calculus for Frechet spaces (Hamilton 1982, Theorem 3.2.2), 

<i>(g) = aq ■ aq + I3(q ■ b)'^ + q ■ c + 

where 7 is a constant of integration. (The latter claim my be verified directly 
by differentiation.) Since $ must be in the form (11), the claim follows. □ 
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